IODS course project

Matti Lassila



About the project

IODS (Introduction to Open Data Science) is a MOOC offered by University of Helsinki. My goal is to learn new R tips and tricks and to become more organized and productive in data analysis tasks.

Source of this learning diary is available at my Github repo.

Last updated: 02.12.2018


Week 2 – Regression and model validation

The following analysis uses dataset (n=166) gathered using ASSIST: the appproaches and study skills inventory for studets -survey. For more information regarding the dataset, please see Vehkalahti (2016).

The purpose of the analysis is to explore, whether is it possible to predict student’s test performance using composite variables depicting student’s mental inclination towards studying, namely:

  1. positive attitude
  2. desire to learn deeply
  3. desire to learn quickly without commintment to truly understand the subject matter.
  4. strategic mentality torwards studying

These four composite variables were created from the set of 42 variables. Original variables were measured using five-point Likert scale and composite variables were scaled back to the five-point range.

Exploratory analysis

From the explatory graph we can see that the gender and age distribution is rather skewed in the data. There are almost 50% more female students than males – especially among the students in their twenties. Even though the mean age is 25.51, there are lots of older students too, the maximum age being 55.

Visual inspection indicates that even though there are slight differences in the attitude and learning approch scores (deep, surface, strategic) between males and females it seems that these differences do not have an effect on median test points. From the figure below we can see that the bulk of the data sits right in the middle of the scale so that there are very few datapoints in the ends of the scales.

Relationship of gender, attitude and strategic study approach

Regression analysis

In building linear regression models, it is desirable to have high correlations between the prediction covariates and the response variable, but small correlations between the different prediction covariates. Large correlations between prediction variables leads to the problem of collinearity in linear regression, eg. the model might overfit the data.

Based on explorative analysis in Fig 1, three variables were chosen as prediction covariates.

For the first model, the following variables were chosen as predictors.

  • attitude (correlation with points, 0.437)
  • strategic learning approach (correlation with points, 0.146)
  • surface learning approach (correlation with points, -0.144)

Attitude and strategic learning approach do not correlate (0.06), strategic learning and surface learning approach correlate slightly (-0.16) as well as surfarce learning approach and attitude (-0.17).

Initial model

Regression summary
Predictor Estimate Confidence Interval (95%) t-statistic p-value
Intercept 11.02 \([3.74\), \(18.29]\) 2.99 .003
Attitude 3.40 \([2.26\), \(4.53]\) 5.91 < .001
Stra 0.85 \([-0.22\), \(1.92]\) 1.58 .117
Surf -0.59 \([-2.17\), \(1.00]\) -0.73 .466

F Test summary: (R2=0.2074, F(3,162)=14.13, p<1e-05). Adjusted R2 is 0.19.

R2 is the percentage of the dependent variable variation explained by the linear model. High R-squared values represent smaller differences between the observed data and the fitted values. Having R2 = 0.2074, our first model explains 20% of the observed variation in student’s test scores.

From the summary we can see, that the surface learning variable (surf) doesn’t have statistically significant relationship with test points (p=0.446). Neither Strategic learning approach -variable (stra) is significant (p=0.177) so we’ll exclude them from the final model.

Final model

Regression summary
Predictor Estimate Confidence Interval (95%) t-statistic p-value
Intercept 11.64 \([8.02\), \(15.25]\) 6.36 < .001
Attitude 3.53 \([2.41\), \(4.65]\) 6.21 < .001

For the final model F Test summary: (R2=0.1906, F(1,164)=38.61, p<1e-05). Adjusted R2 is 0.19.

We see that R2 stays the same for the revised model. Based on the model coefficients (estimate) and 95% confidence interval for the estiate, we see that for the one unit increase in attitude (predictor) variable, the increase in points is between 2.41 and 4.65, average increase being 3.53 points.

Regression diagnostics

Regression diagnostics plots help to observe, if the model meets the assumptions of linear regression:

  • Linearity of residuals
  • Independence of residuals
  • Normal distribution of residuals
  • Equal variance of residuals

Residuals vs Fitted -plot shows if residuals have non-linear patterns. If the residuals have constant variance of errors, the plot scatters randomly without any distict patterns. We don’t see patterns here, so we can conclude that the variance is constant and the assumption of equal variance of residuals is met.

Normal Q-Q -plot shows if residuals are normally distributed. If the residuals deviate from the straight line it is an indication of non-normality. There are few outlying values in both ends of the line but majority of of the points line nicely and therefore we can conclude that the the residuals of the model are normally distributed.

Residual vs. leverage -plot helps to find influential cases. Not all outliers have leverage in linear regression analysis. Values to observe are at the upper right corner or at the lower right corner – they are the ones which can have an effect on regression results. We don’t see outlying values in these critical spots, so we can conclude that there is no influencial individual cases in the data.

Extra - tweaking the model

Even though we just concluded that there’s no influencial individual cases in the data, we can see from the plots that there are few outlying cases in non-critical spots of the Residual vs. leverage plot. What happens if we exclude them from the data? To do it, we have to single out the cases from the data using Cook’s distance.

Just for the sake of curiosity, let’s investige how excluding three outlier cases (36, 56 & 71) has an effect on our regression model.

Regression summary
Predictor Estimate Confidence Interval (95%) t-statistic p-value
Intercept 10.24 \([6.81\), \(13.68]\) 5.89 < .001
Attitude 4.01 \([2.95\), \(5.08]\) 7.43 < .001

For the model created using data without cases 35, 56, and 71, the F Test summary: (R2=0.2555, F(1,161)=55.26, p<1e-05). Adjusted R2 is 0.25.

The fit of the model seems to be marginally better when three outlying cases are excluded. Let’s see from the data what kind of cases these are so that we could decide if it is prudent to consider them outliers.

Outlier cases
Gender Age Attitude Deep Strategic Surface Points
M 20 4.2 4.50 3.25 1.58 10
F 45 3.8 3.00 3.12 3.25 9
F 23 1.9 4.33 2.75 2.92 29

Looking the data, two unfortunate students have scored very low in the test, even though their motivation is strong. It might be that some external cause has affected their test performance as other low-scoring (points <=10) students somewhat differ from these cases. There is also one student whose attitude score is very low but who has achieved high points in tests.

Outlier students
Gender Age Attitude Deep Strategic Surface Points
M 53 3.5 3.50 3.12 2.25 10
M 29 1.7 4.08 3.00 3.75 9
F 37 2.3 3.67 2.75 2.42 9
M 20 4.2 4.50 3.25 1.58 10
F 45 3.8 3.00 3.12 3.25 9
F 21 3.5 3.92 3.88 3.50 7

Week 3 – Logistic regression

Our goal if to predict the level of alcohol usage based on data collected using school reports and questionnaires from two Portugese secondary-level schools. The original dataset (n=382) has the following variables:

  • school - student’s school (binary: ‘GP’ - Gabriel Pereira or ‘MS’ - Mousinho da Silveira)
  • sex - student’s sex (binary: ‘F’ - female or ‘M’ - male)
  • age - student’s age (numeric: from 15 to 22)
  • address - student’s home address type (binary: ‘U’ - urban or ‘R’ - rural)
  • famsize - family size (binary: ‘LE3’ - less or equal to 3 or ‘GT3’ - greater than 3)
  • pstatus - parent’s cohabitation status (binary: ‘T’ - living together or ‘A’ - apart)
  • medu - mother’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 - 5th to 9th grade, 3 - secondary education or 4 - higher education)
  • fedu - father’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 - 5th to 9th grade, 3 - secondary education or 4 - higher education)
  • mjob - mother’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)
  • fjob - father’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)
  • reason - reason to choose this school (nominal: close to ‘home’, school ‘reputation’, ‘course’ preference or ‘other’)
  • guardian - student’s guardian (nominal: ‘mother’, ‘father’ or ‘other’)
  • traveltime - home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour)
  • studytime - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)
  • failures - number of past class failures (numeric: n if 1<=n<3, else 4)
  • schoolsup - extra educational support (binary: yes or no)
  • famsup - family educational support (binary: yes or no)
  • paid - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)
  • activities - extra-curricular activities (binary: yes or no)
  • nursery - attended nursery school (binary: yes or no)
  • higher - wants to take higher education (binary: yes or no)
  • internet - Internet access at home (binary: yes or no)
  • romantic - with a romantic relationship (binary: yes or no)
  • famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent)
  • freetime - free time after school (numeric: from 1 - very low to 5 - very high)
  • goout - going out with friends (numeric: from 1 - very low to 5 - very high)
  • dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high)
  • walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)
  • health - current health status (numeric: from 1 - very bad to 5 - very good)
  • absences - number of school absences (numeric: from 0 to 93) *
  • g1 - first period grade (numeric: from 0 to 20)
  • g2 - second period grade (numeric: from 0 to 20)
  • g3 - final grade (numeric: from 0 to 20, output target)

In addition, the following variables were created for the analysis

  • alc_use - mean of workday and weekend alcohol consumption (numeric)
  • high_use - binary variable, TRUE if alc_use > 2

Numerical abseentism score was transformed to five-class categorical variable (0–1, 2, 3–4, 5–7, >8) for the purpose of the analysis.

Because there are as many as 27 variables in the final dataset, choosing interesting and hopefully fruitful variables just by guessing is not likely to succeed. Therefore, let’s use Goodman-Kruskal tau to explore associations between variables in the dataset.

We have to use tau for this, instead of familiar Pearson correlation, as it doesn’t give us useful results when variables are not continuous. It should be noted, that even though many of the variables in the dataset are numerical (such as studytime, frequency of going out and amount of abseentism), according to the dataset description they are actually ordered categorical variables and therefore using tau to study association is sensible choice.

It can be seen from the initial Goodman-Kruskal tau-matrix that the association between heavy drinking and other variables is strongest between frenquency of going out with friends, amount of abseentism, time dedicated for studying and gender. It should be noted that the association with gender is as strong as with freetime, but intuitively if feels more reasonable study the association of gender to the drinking habits than the amount of free time.

Based on observed associations in the data and personal experience, a reasonable initial hypothesis would be:

Going often out with friends and is correlated with high drinking. Gender doesn’t make signinificant difference, even tough male students tend to drink more. Time dedicaded to studies is inversely associated with high alcohol use but igh abseentism has no significant association with heavy drinking

Next we’ll observe the relationship of variables using mosaic plots, which allow us to explore the data in multiple dimensions at the same time.

Looking the plots it seems that our initial hypothesis was somewhat inaccurate. Differences between genders are at least visually observing large. Going often out with friends is indeed associated with high level of alcohol use, especially when the abseentism is common (top right corner of the first plot). On the other hand, the size of this group is not very large, so this three-way association might not be strong. Time dedicated to studies is inversely correlated with high alcohol use and amount of abseentism (top right corner).

Modeling

Initial model

## 
## Call:
## glm(formula = high_use ~ goout + sex + absnt + studytime, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0011  -0.7391  -0.4707   0.7279   2.6452  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     -2.34971    0.69745  -3.369 0.000754 ***
## gooutlow         0.39022    0.70228   0.556 0.578449    
## gooutmod         0.63488    0.69198   0.917 0.358895    
## goouthigh        2.16469    0.69593   3.110 0.001868 ** 
## gooutvhigh       2.38300    0.71257   3.344 0.000825 ***
## sexM             0.76948    0.27549   2.793 0.005220 ** 
## absnt2           0.41615    0.41509   1.003 0.316074    
## absnt3–4         0.03195    0.38525   0.083 0.933897    
## absnt5–7        -0.20080    0.42637  -0.471 0.637679    
## absnt>8          1.05424    0.36589   2.881 0.003961 ** 
## studytime2–5 h  -0.44563    0.30751  -1.449 0.147302    
## studytime5–10 h -0.91718    0.48515  -1.891 0.058689 .  
## studytime>10 h  -1.01073    0.62811  -1.609 0.107583    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 374.51  on 369  degrees of freedom
## AIC: 400.51
## 
## Number of Fisher Scoring iterations: 4

For our initial variables, frequency of going out with friends, gender, level of abseentism and time dedicated to studies make a statistically significant effect on level of drinking

Looking the odds ratios of the model, it can be stated that being male increases the odds of high alcohol usage over 2 times compared to females, going out frequently or very frequently increses the odds 8-10 times compared to the students who go out very little or not at all. High abseentism also increases the odds of high alcohol consumption 2.87 times compared to the studets who are not in the habit of abseentism.

These model-based observations align pretty well on the conclusions which were made based on the initial visual exploration of the data.

Predictor Estimate CI (95%) t p-value Odds ratio
gooutmod 0.63 \([-0.61\), \(2.18]\) 0.92 .359 1.89
goouthigh 2.16 \([0.92\), \(3.72]\) 3.11 .002 8.71
gooutvhigh 2.38 \([1.10\), \(3.97]\) 3.34 .001 10.84
sexm 0.77 \([0.23\), \(1.32]\) 2.79 .005 2.16
abseentism>8 1.05 \([0.34\), \(1.78]\) 2.88 .004 2.87
studytime5–10 h -0.92 \([-1.92\), \(0.00]\) -1.89 .059 0.40

Final model

## 
## Call:
## glm(formula = high_use ~ goout + sex + absnt, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9016  -0.7054  -0.4885   0.6704   2.4222  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.72000    0.67394  -4.036 5.44e-05 ***
## gooutlow     0.20910    0.69706   0.300 0.764194    
## gooutmod     0.42097    0.68290   0.616 0.537598    
## goouthigh    1.99006    0.68527   2.904 0.003684 ** 
## gooutvhigh   2.24070    0.70637   3.172 0.001513 ** 
## sexM         0.94468    0.26159   3.611 0.000305 ***
## absnt2       0.44525    0.41055   1.085 0.278133    
## absnt3–4     0.09012    0.37728   0.239 0.811208    
## absnt5–7    -0.15888    0.41776  -0.380 0.703710    
## absnt>8      1.16366    0.36153   3.219 0.001288 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 379.98  on 372  degrees of freedom
## AIC: 399.98
## 
## Number of Fisher Scoring iterations: 4
Predicted vs. actual alcohol use (counts)
predicted low predicted high
low 245 23
high 58 56
Predicted vs. actual alcohol use (%)
predicted low predicted high Sum
low 64.1 6.0 70.1
high 15.2 14.7 29.9
Sum 79.3 20.7 100.0

Crosstables indicate that model tends to overpredict the proportion of students who are low-drinkers (actual) and underpredict the proportion of high-drinking students.

Predictor Estimate CI (95%) t p-value Odds ratio
goouthigh 1.99 \([0.76\), \(3.53]\) 2.90 .004 7.32
gooutvhigh 2.24 \([0.97\), \(3.81]\) 3.17 .002 9.40
sexm 0.94 \([0.44\), \(1.47]\) 3.61 < .001 2.57
abseentism>8 1.16 \([0.46\), \(1.88]\) 3.22 .001 3.20

For the final model, the odds ratio of high alcohol use increasse 2.57 times when the student is male, 3.20 times when number of school absensces is higher than 8 (compared to zero to one absences), and 7.32-9.40 times when the student goes out with friends frequently or very frequently, compared to going out very little or no at all.

Performance comparison

The model has 0.212 training error. Error for initial model is 0.204 so it seems that keeping studytime predictor in the model would have been useful.

Based on the 10-fold cross-validation, the prediction error for the model is 0.22 so it seems that this model performs a bit better than the model introduced in the DataCamp which had 0.26 error. Surprisingly, prediction error for the initial model is 0.22, so it is the best model of all three. To conlude, this somewhat unexpected result idicates that it might be useful to keep also marginally signitficant predictors in the model, as was the case with studytime predictor.

References


Week 4 – Clustering

Our goal for the fourth week is to explore census data collected in 1970 from the Boston metropolitan area. Our first step is to try to predicting the crime rate in the Boston area using plethora of predictor variables. Finally we’ll see what kind of clusters we can find from the dataset when running k-means clustering algorithm over the data.

The dataset is discussed more deeply in Harrison, D. and Rubinfeld, D.L. (1978) Hedonic prices and the demand for clean air. Journal of Environmental Economics and Management 5, 81–102 (PDF)

## Skim summary statistics
##  n obs: 506 
##  n variables: 14 
## 
## ── Variable type:integer ────────────
##  variable missing complete   n  mean   sd p0 p25 p50 p75 p100     hist
##      chas       0      506 506 0.069 0.25  0   0   0   0    1 ▇▁▁▁▁▁▁▁
##       rad       0      506 506 9.55  8.71  1   4   5  24   24 ▂▇▁▁▁▁▁▅
## 
## ── Variable type:numeric ────────────
##  variable missing complete   n   mean     sd       p0     p25    p50
##       age       0      506 506  68.57  28.15   2.9     45.02   77.5 
##     black       0      506 506 356.67  91.29   0.32   375.38  391.44
##      crim       0      506 506   3.61   8.6    0.0063   0.082   0.26
##       dis       0      506 506   3.8    2.11   1.13     2.1     3.21
##     indus       0      506 506  11.14   6.86   0.46     5.19    9.69
##     lstat       0      506 506  12.65   7.14   1.73     6.95   11.36
##      medv       0      506 506  22.53   9.2    5       17.02   21.2 
##       nox       0      506 506   0.55   0.12   0.39     0.45    0.54
##   ptratio       0      506 506  18.46   2.16  12.6     17.4    19.05
##        rm       0      506 506   6.28   0.7    3.56     5.89    6.21
##       tax       0      506 506 408.24 168.54 187      279     330   
##        zn       0      506 506  11.36  23.32   0        0       0   
##     p75   p100     hist
##   94.07 100    ▁▂▂▂▂▂▃▇
##  396.23 396.9  ▁▁▁▁▁▁▁▇
##    3.68  88.98 ▇▁▁▁▁▁▁▁
##    5.19  12.13 ▇▅▃▃▂▁▁▁
##   18.1   27.74 ▃▆▅▁▁▇▁▁
##   16.96  37.97 ▆▇▆▅▂▁▁▁
##   25     50    ▂▅▇▆▂▂▁▁
##    0.62   0.87 ▇▆▇▆▃▅▁▁
##   20.2   22    ▁▂▂▂▅▅▇▃
##    6.62   8.78 ▁▁▂▇▇▂▁▁
##  666    711    ▃▇▂▅▁▁▁▆
##   12.5  100    ▇▁▁▁▁▁▁▁

The dataset has 506 observations and 14 variables. My guess is that observations correspond to cencus districts, eg. subportions of town, as it is very unlikely that Boston metropolitan has over 500 towns.

de De scription
crim per capita crime rate by town.
zn proportion of residential land zoned for lots over 25,000 sq.ft.
indus proportion of non-retail business acres per town.
chas Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
nox nitrogen oxides concentration (parts per 10 million).
rm average number of rooms per dwelling.
age proportion of owner-occupied units built prior to 1940.
dis weighted mean of distances to five Boston employment centres.
rad index of accessibility to radial highways.
tax full-value property-tax rate per $10,000.
tratio pupil-teacher ratio by town.
black 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.
lstat lower status of the population (percent).
medv median value of owner-occupied homes in $1000s.

Renaming variables makes it easier in later steps to discuss our findings.

Linear discriminant analysis and prediction

The model predicts the correct class with 73 % accuracy. Location near to highway entraces seems to contribute most to the criminality.

Predicted (cols) vs. actual (rows) crime rate in Boston
very low low high very high
very low 13 10 2 0
low 8 15 2 0
high 1 11 11 2
very high 0 0 1 24

K-means clustering

Plot represents the variance within the clusters. It decreases as k increases, but one can notice a bend (or “elbow”) at ~2. This bend indicates that additional clusters beyond this point have only little value. So, let’s fit k-means clustering with k=2.

Looking the plots, our two clusters correspond pretty nicely to our intuition of crime-prone neighbourhoods residing near high-pollution industrial areas and highways. More affluent cluster 1 (colour black) has less criminality. Cluster 2 has more pollution, is located near to highways, has factories nearby and has more criminality.

Affulent and less affulent neighbourhoods of Boston metropolitan area
cluster proportion_industry per_capita_crime nox_concentration proportion_old_houses highway_access tax_value lower_status
1 -0.6146857 -0.3894158 -0.5823397 -0.4331555 -0.5828749 -0.6291043 -0.4530491
2 1.1425514 0.7238295 1.0824279 0.8051309 1.0834228 1.1693521 0.8421083

References


Week 5 – Dimensionality reduction techniques

This week we’ll try out dimensionality reduction techniques, namely principal components analysis (PCA) and multiple correspondence analysis (MCA) which can be used to explore multidimensional data as points in a low-dimensional space. PCA is applicable for continous data and MCA can be used when the measurement level of variables is at least categorical.

Exploring human development data with principal component analysis

The dataset we’ll use is from the United Nations Development Programme. For the following analysis, the dataset was preprocessed using an external script.

The dataset has the following variables:

de De scription
country name of the country (as row name)
edu_exp expected years of education
sec_edu_female proportion of females at least 2nd level education
labour_rate_female proportion of women participating in working life
life_exp life expectancy at birth
gni gross national income
mat_mortality maternal mortality ratio
birth_rate the number of births to women ages 15–19 (per 1000 women)
repr_parlament percetange of female members of the parliament

Most of the variable distributions are more-or-less skewed, so it is very likely that PCA will not work without normalizing the data. Only the edu_exp (expected years of education) variable looks normally distributed.

Statistic Min Pctl(25) Median Pctl(75) Max Median St. Dev.
edu_exp 5,40 11,25 13,50 15,20 20,20 13,50 2,84
sec_edu_female 0,90 27,15 56,60 85,15 100,00 56,60 30,93
labour_rate_female 13,50 44,45 53,50 61,90 88,10 53,50 16,04
life_exp 49,00 66,30 74,20 77,25 83,50 74,20 8,33
gni 581 4197,5 12040 24512 123124 12040 18543,85
mat_mortality 1 11,5 49 190 1100 49 211,79
birth_rate 0,60 12,65 33,60 71,95 204,80 33,60 41,11
repr_parlament 0,00 12,40 19,30 27,95 57,50 19,30 11,49

Unexpectedly, maternal mortality and life expectency are highly correlated (-0.86) as well as high birth rate and maternal mortality (0.76). Variables regarding education have also strong correlation with maternal mortality, so that higher values in education variables, lead to lower value in maternal mortality rate. Secondary education of women have an association to lower birth rate (-0.68), as well as higher life expectancy (-0.72). Quite surprisingly, women’s representation in the parliament doesn’t seem to have an association with any of other variables or the association is very low. As one might expect, maternal mortality and high birth rate are associated with low GNI.

First round of PCA

Principal component analysis of non-standardized data
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
Standard deviation 18544.2 186.3 26.0 20.1 14.3 10.6 3.7 1.4
Eigenvalue 343886317.3 34701.6 674.6 403.0 205.1 113.1 13.8 2.0
Proportion of Variance 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Cumulative Proportion 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0

As expected, PCA with non-standardized data doesn’t make much sense as it’ll happen that the variable will largest variance explains virtually all of the variance and therefore creating a biplot will fail miserably.

Second round of PCA

Principal component analysis of standardized data
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
Standard deviation 2.14 1.12 0.88 0.72 0.55 0.53 0.45 0.34
Eigenvalue 4.56 1.25 0.77 0.52 0.31 0.28 0.20 0.11
Proportion of Variance 0.57 0.16 0.10 0.06 0.04 0.04 0.03 0.01
Cumulative Proportion 0.57 0.73 0.82 0.89 0.93 0.96 0.99 1.00

After scaling, PCA results will make sense. Because in the original data, variance and distribution vary wildly, using PCA in non-standardized data will yield strange results as PCA as a analysis technique is sensitive to to the relative scaling of variables.

The first principal component explains 57.03% of the total variance of the dataset, and the second principal component explains 15.6% of it. It seems that the first principal component represents classical welfare state, eg. healthcare and access to education. The second principal component represents gender equality, eg. do women have equal rights to participate in politics and working life. Looking the biplot, it is evident that having an all-embracing welfare state doesn’t necessarily make a country gender-equal and vice-versa, Qatar (and other oil-rich states) and Rwanda exemplifying this phenomenon.

Our initial exploration of the data using a correlation matrix revealed the complex relationship between variables partially but usage of PCA biplot makes it possible to understand the whole dataset in an intuitive way almost at a glance.

Exploring tea survey data with multiple correspondence analysis

The following questions which were used to gather the tea dataset can be found from the book Exploratory Multivariate Analysis by Example Using R, Second Edition, pages 131–132.

  1. What kind of tea do you drink the most (black tea, green tea, flavored tea)
  2. How do you take your tea (nothing added, with lemon, with milk, other)
  3. What kind of tea you buy ? (tea bags, loose tea, both)
  4. Do you add sugar to your tea (yes, no)?
  5. Where do you buy your tea (supermarket, specialist shop, both)
  6. What kind of tea you buy (cheapest supermarket brand, well-known brand, upscale, it varies)
  7. How ofter do you drink tea (more than twice a day, once a day, 3 to 6 times a week, once or twice per week)?

Location in wich the tea is drunk (yes/no):

  1. Do you drink tea at home?
  2. Do you drink tea at work?
  3. Do you drink tea in tearooms or coffee shops?
  4. Do you drink tea at frinds houses?
  5. Do you drink tea in restaurants?
  6. Do you drink tea in bars?

Time of the day (yes/no):

  1. Do you drink tea at breakfast?
  2. Do you drink tea in the afternoon?
  3. Do you drink tea in the evening?
  4. Do you drink tea after lunch?
  5. Do you drink tea after dinner?
  6. Do you drink tea throughout the day?

Questions about the image of the product (yes/no):

20 Do you consider tea to be exotic? 21. Do you associate tea with spirituality? 22. Is tea good for your health? 23. Is tea a diuretic? 24. Do you associate tea with friendliness? 25. Does tea stop the body from absorbing iron? 26. Is tea feminine? 28. Is tea refined? 29. Will tea help you to lose weight? 30. Is tea a stimulant? 31. Is tea a relaxant? 32. Does tea have any effect on your overall health?

Additionally, there were four background questions regarding:

  • sex
  • professional category (farmer, manual labourer, professional, senior management, employee, other profession, student, unemployed)
  • age
  • regular participation in sports (yes or no).

Skim summary statistics
n obs: 300
n variables: 36

Variable type: factor
variable missing complete n n_unique top_counts ordered
age_Q 0 300 300 5 15-: 92, 25-: 69, 45-: 61, 35-: 40 FALSE
always 0 300 300 2 Not: 197, alw: 103, NA: 0 FALSE
breakfast 0 300 300 2 Not: 156, bre: 144, NA: 0 FALSE
dinner 0 300 300 2 Not: 279, din: 21, NA: 0 FALSE
diuretic 0 300 300 2 diu: 174, Not: 126, NA: 0 FALSE
effect.on.health 0 300 300 2 No.: 234, eff: 66, NA: 0 FALSE
escape.exoticism 0 300 300 2 Not: 158, esc: 142, NA: 0 FALSE
evening 0 300 300 2 Not: 197, eve: 103, NA: 0 FALSE
exciting 0 300 300 2 No.: 184, exc: 116, NA: 0 FALSE
feminine 0 300 300 2 Not: 171, fem: 129, NA: 0 FALSE
frequency 0 300 300 4 +2/: 127, 1/d: 95, 1 t: 44, 3 t: 34 FALSE
friendliness 0 300 300 2 fri: 242, Not: 58, NA: 0 FALSE
friends 0 300 300 2 fri: 196, Not: 104, NA: 0 FALSE
healthy 0 300 300 2 hea: 210, Not: 90, NA: 0 FALSE
home 0 300 300 2 hom: 291, Not: 9, NA: 0 FALSE
how 0 300 300 3 tea: 170, tea: 94, unp: 36, NA: 0 FALSE
How 0 300 300 4 alo: 195, mil: 63, lem: 33, oth: 9 FALSE
iron.absorption 0 300 300 2 Not: 269, iro: 31, NA: 0 FALSE
lunch 0 300 300 2 Not: 256, lun: 44, NA: 0 FALSE
price 0 300 300 6 p_v: 112, p_b: 95, p_u: 53, p_p: 21 FALSE
pub 0 300 300 2 Not: 237, pub: 63, NA: 0 FALSE
relaxing 0 300 300 2 rel: 187, No.: 113, NA: 0 FALSE
resto 0 300 300 2 Not: 221, res: 79, NA: 0 FALSE
sex 0 300 300 2 F: 178, M: 122, NA: 0 FALSE
slimming 0 300 300 2 No.: 255, sli: 45, NA: 0 FALSE
sophisticated 0 300 300 2 sop: 215, Not: 85, NA: 0 FALSE
SPC 0 300 300 7 stu: 70, non: 64, emp: 59, mid: 40 FALSE
spirituality 0 300 300 2 Not: 206, spi: 94, NA: 0 FALSE
Sport 0 300 300 2 spo: 179, Not: 121, NA: 0 FALSE
sugar 0 300 300 2 No.: 155, sug: 145, NA: 0 FALSE
Tea 0 300 300 3 Ear: 193, bla: 74, gre: 33, NA: 0 FALSE
tea.time 0 300 300 2 tea: 169, Not: 131, NA: 0 FALSE
tearoom 0 300 300 2 Not: 242, tea: 58, NA: 0 FALSE
where 0 300 300 3 cha: 192, cha: 78, tea: 30, NA: 0 FALSE
work 0 300 300 2 Not: 213, wor: 87, NA: 0 FALSE
Variable type: integer
variable missing complete n mean sd p0 p25 p50 p75 p100 hist
age 0 300 300 37.05 16.87 15 23 32 48 90 ▇▆▃▅▂▂▁▁

Let’s use Goodman-Kruskal tau to explore associations between variables in the dataset. Goodman-Kruskal tau can be used to study association when the measurement level of the variable is at least categorical.

Going trough the Goodman-Kruskal tau-matrix, we can observe few interesting associations between variables. Overall, associations are not very strong, but the following associations emerge when taking into account associations which are stronger than the mean tau-value (0.08).

  • gender and considering tea feminine or not feminine
  • age and considering tea feminine or not feminine
  • age and a belief that tea has an effect on overall health
  • a belief that tea has an effect on overall health and and considering tea healthy
  • type of the tea and age
  • type of the tea and habit of using sugar
  • drinking tea at tearoom and buying tea from specialist shops
  • frequency of drinking tea and drinking tea at breakfast
  • price, type of tea and the place of purchase

Even though we’ll eventually use all variables in the MCA, let’s observe few of these associations bit more closely, just for the sake of curiosity.

  • There are more female than male responders to the survey. Over the half of the responders are under 35.
  • It seems that the bulk of the tea is being bought from the chain stores and the preferred style is Earl Gray.
  • Interestingly, considering all answers tea is not considered feminine, but females tend to associate tea with femininity more often than men.

Drinkers of different types of tea

Let’s use MCA to investigate if there are observable differences between groups of people who prefer different types of tea.

Looking the biplot graph, drinkers of different types of tea seem to somewhat differ. People who drink green and black tea are more often buying their tea from specialist shops, prefer unpackaged tea and dont’t mind spending more money. Drinking Earl Gray seems to serve similiar purpose as habitual coffee drinking in Finland – people drink it at work or with friends during lunch.

It should be kept in mind that the proportion of green tea drinkers is very small compared to the friends of Earl Gray so that these conclusions based on biplot are likely not very strong.

Extra plots

The following interactive plot makes it possible to investigate which responses have the strongest contribution to the model. Darker the colour, stronger the contribution.

Scree plot cat be used to visualize the percentage of inertia (eg. in this context, variance) explained by MCA dimensions. From the plot we can see that the first dimension explains approximately 10% percent of variance and the second one 8%.

Vehkalahti, Kimmo. 2016. “The Relationship Between Learning Approaches and Students’ Achievements in an Introductory Statistics Course in Finland.” In Proceedings of the 60th ISI World Statistics Congress of the International Statistical Institute, ISI2015, 68–73. http://hdl.handle.net/10138/163015.